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Abstract 
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^ We present a procedxire for effective estimation of entropy and mutual infor- 

mation from small-sample data, and apply it to the problem of inferring high- 
I— ) dimensional gene association networks. Specifically, we develop a James-Stein-type 

shrinkage estimator, resulting in a procedure that is highly efficient statistically as 
^ well as computationally. Despite its simplicity, we show that it outperforms eight 

other entropy estimation procedures across a diverse range of sampling scenarios 
and data-generating models, even in cases of severe imdersampling. We illustrate 
the approach by analyzing E. coli gene expression data and computing an entropy- 
^ based gene-association network from gene expression data. A computer program is 

^ available that implements the proposed shrinkage estimator. 

d 
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1 Introduction 



Entropy is a fundamental quantity in statistics and machine learning. It has a large 
number of applications, for example in astronomy, cryptography, signal processing, 
statistics, physics, image analysis neuroscience, network theory, and bioinformatics — 
see, for example, [Stinson| ( [2006l l, |Yeo and Burgel ( [2004l l, |MacKay| ( [2003 ) and [Strong etaL 
( 1998 1. Here we focus on estimating entropy from small-sample data, with applications 



in genomics and gene network inference in mind ( Margolin et al. 2006} Meyer et al.[ 
20^ . 

To define the Shannon entropy, consider a categorical random variable with alphabet 
size p and associated cell probabilities 6i,...,6p with 9k > and Ylk ^/c = 1- Throughout 
the article, we assume that p is fixed and known. In this setting, the Shannon entropy in 
natural units is given b}|^ 



H 



(1) 



In practice, the underlying probability mass function are unknown, hence H and 9k need 
to be estimated from observed cell counts yk > 0. 

A particularly simple and widely used estimator of entropy is the maximum likeli- 
hood (ML) estimator 



ML 



Eer^os{9r) 



/c=l 



constructed by plugging the ML frequency estimates 



3ML 



yk 

n 



(2) 



into Eq.|lj with n = E^^^ yk being the total number of counts. 

In situations with n ^ p, that is, when the dimension is low and when there are 
many observation, it is easy to infer entropy reliably, and it is well-known that in this 
case the ML estimator is optimal. However, in high-dimensional problems with n <C p 
it becomes extremely challenging to estimate the entropy. Specifically, in the "small n, 
large p" regime the ML estimator performs very poorly and severely underestimates the 
true entropy. 

While entropy estimation has a long history tracing back to more than 50 years ago, 
it is only recently that the specific issues arising in high-dimensional, undersampled 
data sets have attracted attention. This has lead to two recent innovations, namely the 
NSB algorithm ( Nemenman et aEj 20021 and the Chao-Shen estimator (Chao and Shen 



2003| , both of which are now widely considered as benchmarks for the small-sample 
entropy estimation problem ( |Vu et al.] 2007[ ). 

Here, we introduce a novel and highly efficient small-sample entropy estimator 
based on James-Stein shrinkage ( Gruberj 1998| |. Our method is fully analytic and hence 



^In this paper we use the following conventions: log denotes the natural logarithm {not base 2 or base 
10), and we define log = 0. 
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computationally inexpensive. Moreover, our procedure simultaneously provides esti- 
mates of the entropy and of the cell frequencies suitable for plugging into the Shannon 
entropy formula (Eq.|l]|. Thus, in comparison the estimator we propose is simpler, very 
efficient, and at the same time more versatile than currently available entropy estimators. 



2 Conventional Methods for Estimating Entropy 

Entropy estimators can be divided into two groups: i) methods, that rely on estimates 
of cell frequencies, and ii) estimators, that directly infer entropy without estimating a 
compatible set of 9^. Most methods discussed below fall into the first group, except for 
the Miller-Madow and NSB approaches. 



2.1 Maximum Likelihood Estimate 

The connection between observed counts y^t ^nd frequencies 9^ is given by the multino- 
mial distribution 

n' P 

Prob(i/i,...,yp;0i,...,0p) = ' J lef. (3) 

Note that 9]^ > because otherwise the distribution is singular. In contrast, there may 
be (and often are) zero counts y]^. The ML estimator of 9]^ maximizes the right hand 
side of Eq.jsjfor fixed y^, leading to the observed frequencies 9^^ = ^ with variances 

Var(0f L) = 10^(1 _ and Bias(0f = as £{6^^) = 0^. 



2.2 Miller-Madow Estimator 

While 9^^ is unbiased, the corresponding plugin entropy estimator is not. First 
order bias correction leads to 



H 



MM _ ^ML _|_ ^>0 1 

2n 



where m>o is the number of cells with yj^ > 0. This is known as the Miller-Madow 
estimator (|Miller|[l955|). 



2.3 Bayesian Estimators 

Bayesian regularization of cell counts may lead to vast improvements over the ML 
estimator ( [Agresti and Hitchcock 2005| . Using the Dirichlet distribution with parameters 



fli, a2,...,ap as prior, the resulting posterior distribution is also Dirichlet with mean 

^Bayes _ j/)c + '^/c 

~ n + A 
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where A = YX=i '^k- The flattening constants fl)t pl^Y fhe role of pseudo-counts (compare 
with Eq.|2|, so that A may be interpreted as the a priori sample size. 

Some common choices for a^- are listed in Tab.jTj along with references to the corre- 
sponding plugin entropy estimators. 



P 
k=l 



While the multinomial model with Dirichlet prior is standard Bayesian folklore ( Gelman 



et al. 2004[ |, there is no general agreement regarding which assignment of fljt is best as 
noninformative prior — see for instance the discussion in |Tuyl et al. ( |2008 1 and Geisser 
(1984)). But, as shown later in this article, choosing inappropriate fljc can easily cause 
the resulting estimator to perform worse than the ML estimator, thereby defeating the 
originally intended purpose. 



2.4 NSB Estimator 



The NSB approach (Nemenman et al. 20021 avoids overrelying on a particular choice 
of in the Bayes estimator by using a more refined prior. Specifically, a Dirichlet 
mixture prior with infinite number of components is employed, constructed such that 
the resulting prior over the entropy is uniform. While the NSB estimator is one of the 
best entropy estimators available at present in terms of statistical properties, using the 
Dirichlet mixture prior is computationally expensive and somewhat slow for practical 
applications. 



2.5 Chao-Shen Estimator 



Another recently proposed estimator is due to |Chao and Shen ( 2003[). This approach 
applies the Horvitz-Thompson estimator (Horvitz and Thompson 1952[ ) in combination 
with the Good-Turing correction ( [Good 1953[ Orlitsky et al.j 2003 1 of the empirical cell 
probabilities to the problem of entropy estimation. The Good-Turing-corrected frequency 
estimates are 

3GT _ n "^1 \qML 







(1 





1/2 
1 

1/p 



Cell frequency prior 



Entropy estimator 



Jeffreys prior ( Jeffreys 1946 ( 
Bayes-Lap lace u niform prior 



no prior 



Perks prior ("Perks 
minimax prior ( ^Trybula 



19471 



1958 





maximum like 


ihooc 




Krichevsky and Trofimov 


(1981 




Holste et al. 


(1998 


Schiirmann and Grass berger 


(1996 



Table 1: Common choices for the parameters of the Dirichlet prior in the Bayesian 
estimators of cell frequencies, and corresponding entropy estimators. 
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where mi is the ntimber of singletons, that is, cells with i/j. = 1- Used jointly with the 
Horvitz-Thompson estimator this results in 



k=l 



an estimator with remarkably good statistical properties ( |Vu et al. 2007| . 



3 A James-Stein Shrinkage Estimator 

The contribution of this paper is to introduce an entropy estimator that employs James- 
Stein-type shrinkage at the level of cell frequencies. As we will show below, this leads to 
an entropy estimator that is highly effective, both in terms of statistical accuracy and 
computational complexity. 

James-Stein-type shrinkage is a simple analytic device to perform regularized high- 
dimensional inference. It is ideally suited for small-sample settings - the original es- 



timator (James and Stein 19611 considered sample size n = 1. A general recipe for 
constructing shrinkage estimators is given in Appendix A. In this section, we describe 
how this approach can be applied to the specific problem of estimating cell frequencies. 

James-Stein shrinkage is based on averaging two very different models: a high- 
dimensional model with low bias and high variance, and a lower dimensional model 
with larger bias but smaller variance. The intensity of the regularization is determined 
by the relative weighting of the two models. Here we consider the convex combination 

efhrmk ^ + ^1 _ ^)0ML^ (4) 

where AG [0, 1] is the shrinkage intensity that takes on a value between (no shrinkage) 
and 1 (full shrinkage), and t]^ is the shrinkage target. A convenient choice of is the 
uniform distribution ijt = ^. This is also the maximum entropy target. Considering that 

Bias(0^ ) = and using the unbiased estimator Var(0^ ) = we obtain (cf. 

Appendix A) for the shrinkage intensity 



LUtk - err {n - 1) LLiiik - err 

Note that this also assumes a non-stochastic target tj^. The resulting plugin shrinkage 
entropy estimate is 

P 

j^Shrink ^ _ ^ ^Shrink log ( ^Shrink ) _ 
k=l 

Remark 1: 

There is a one to one correspondence between the shrinkage and the Bayes estimator. 
If we write tk = ^ and A = then efhrink ^ ^Bayes ^j^p^^^g ^^lat the shrinkage 
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estimator is an empirical Bayes estimator with a data-driven choice of the flattening 



constants — see also Efron and Morris ( 1973| ). For every choice of A there exists an 



equivalent shrinkage intensity A. Conversely, for every A there exist an equivalent 



A = n ^ 



Remark 2: 



Developing A = = n(A + A^ + . . .) we obtain the approximate estimate A = nX, 



which in turn recovers the "pseudo-Bayes" estimator described in Fienberg and Holland 
( [1973 ) ■ 

Remark 3: 

The shrinkage estimator assumes a fixed and known p. In many practical applications 
this will indeed be the case, for example, if the observed counts are due to discretization 
(see also the data example). In addition, the shrinkage estimator appears to be robust 
against assuming a larger p than necessary (see scenario 3 in the simulations). 

Remark 4: 

The shrinkage approach can easily be modified to allow multiple targets with different 



shrinkage intensities. For instance, using the Good-Turing estimator (Good 1953 Or lit 



sky et al. 2003[ l, one could setup a different uniform target for the non-zero and the zero 



counts, respectively. 



4 Comparative Evaluation of Statistical Properties 

In order to elucidate the relative strengths and weaknesses of the entropy estimators 
reviewed in the previous section, we set to benchmark them in a simulation study 
covering different data generation processes and sampling regimes. 

4.1 Simulation Setup 

We compared the statistical performance of all nine described estimators (maximum 
likelihood, Miller-Madow, four Bayesian estimators, the proposed shrinkage estimator 
(Eqs.|4j|6|, NSB und Chao-Shen) under various sampling and data generating scenarios: 

• The dimension was fixed at p = 1000. 

• Samples size n varied from 10, 30, 100, 300, 1000, 3000, to 10000. That is, we inves- 
tigate cases of dramatic undersampling ("small n, large p") as well as situations 
with a larger number of observed counts. 

The true cell probabilities 0i, . . . , 0iooo were assigned in four different fashions, corre- 
sponding to rows 1-4 in Fig.jl] 
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1. Sparse and heterogeneous, following a Dirichlet distribution with parameter a = 
0.0007, 

2. Random and homogeneous, following a Dirichlet distribution with parameter 
a = 1, 

3. As in scenario 2, but with half of the cells containing structural zeros, and 

4. Following a Zipf-type power law. 

For each sampling scenario and sample size, we conducted 1000 simulation runs. In 
each run, we generated a new set of true cell frequencies and subsequently sampled 
observed counts from the corresponding multinomial distribution. The resulting 
counts yk were then supplied to the various entropy and cell frequencies estimators and 
the squared error Yll^ki^k ~ ^kY computed. From the 1000 repetitions we estimated 
the mean squared error (MSE) of the cell frequencies by averaging over the individual 
squared errors (except for the NSB, Miller-Madow, and Chao-Shen estimators). Similarly, 
we computed estimates of MSE and bias of the inferred entropies. 

4.2 Summary of Results from Simulations 

Fig. |l] displays the results of the simulation study, which can be summarized as follows: 

• Unsurprisingly, all estimators perform well when the sample size is large. 

• The maximum likelihood and Miller-Madow estimators perform worst, except for 
scenario 1. Note that these estimators are inappropriate even for moderately large 
sample sizes. Furthermore, the bias correction of the Miller-Madow estimator is 
not particularly effective. 

• The minimax and 1 / p Bayesian estimators tend to perform slightly better than 
maximum likelihood, but not by much. 

• The Bayesian estimators with pseudocounts 1/2 and 1 perform very well even 
for small sample sizes in the scenarios 2 and 3. However, they are less efficient in 
scenario 4, and completely fail in scenario 1. 

• Hence, the Bayesian estimators can perform better or worse than the ML estimator, 
depending on the choice of the prior and on the sampling scenario. 

• The NSB, the Chao-Shen and the shrinkage estimator all are statistically very 
efficient with small MSEs in all four scenarios, regardless of sample size. 

• The NSB and Chao-Shen estimators are nearly unbiased in scenario 3. 

The three top-performing estimators are the NSB, the Chao-Shen and the prosed shrink- 
age estimator. When it comes to estimating the entropy, these estimators can be con- 
sidered identical for practical purposes. However, the shrinkage estimator is the only 
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Figure 1: Comparing the performance of nine different entropy estimators (maximum 
likelihood, Miller-Madow, four Bayesian estimators, the proposed shrinkage estimator, 
NSB und Chao-Shen) in four different sampling scenarios (rows 1 to 4). The estimators 
are compared in terms of MSE of the underlying cell frequencies (except for Miller- 
Madow, NSB, Chao-Shen) and according to MSE and Bias of the estimated entropies. 
The dimension is fixed at p = 1000 while the sample size n varies from 10 to 10000. 
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one that simultaneously estimates cell frequencies suitable for use with the Shannon 
entropy formula (Eq.jl]), and it does so with high accuracy even for small samples. In 
comparison, the NSB estimator is by far the slowest method: in our simulations, the 
shrinkage estimator was faster by a factor of 1000. 



5 Application to Statistical Learning of Nonlinear Gene Asso- 
ciation Networks 

In this section we illustrate how the shrinkage entropy estimator can be applied to 
the problem of inferring regulatory interactions between genes through estimating the 
nonlinear association network. 



5.1 From Linear to Nonlinear Gene Association Networks 



One of the aims of systems biology is to understand the interactions among genes and 
their products underlying the molecular mechanisms of cellular function as well as 
how disrupting these interactions may lead to different pathologies. To this end, an 
extensive literature on the problem of gene regulatory network "reverse engineering" 



has developed in the past decade (Friedman 20041. Starting from gene expression 
or proteomics data, different statistical learning procedures have been proposed to 
infer associations and dependencies among genes. Among many others, methods 



have been proposed to enable the inference of large-scale correlation networks ( Butte 
et al. 2000 1 and of high-dimensional partial correlation graphs ( |Dobra et al.[ 2004| 



Schafer and Strimmer 2005a Meinshausen and Biihlmann |2006 |), for learning vector- 
autoregress ive ([Opgen-Rhein and Strimmer 2007a| l and state space models ( |Rangel| 
et al.[ |2004} [Lahdesmaki and Shmulevich} 2008 \, and to reconstruct directed "causal" 
interaction graphs ( Kalisch and Biihlmannj 2007 Opgen-Rhein and Strimmer 2007b| |. 



The restriction to linear models in most of the literature is owed at least in part 
to the already substantial challenges involved in estimating linear high-dimensional 
dependency structures. However, cell biology offers numerous examples of threshold 
and saturation effects, suggesting that linear models may not be sufficient to model 
gene regulation and gene-gene interactions. In order to relax the linearity assumption 
and to capture nonlinear associations among genes, entropy-based network modeling 
was recently proposed in the form of the ARACNE ( [Margolin et al.[|2006| ) and MRNET 
( [Meyer et alj 20(37) algorithms. 

The starting point of these two methods is to compute the mutual information 
MI(X, y) for all pairs of genes X and Y, where X and Y represent the expression levels 
of the two genes for instance. The mutual information is the Kullback-Leibler distance 
from the joint probability density to the product of the marginal probability densities: 



MI(X,y) = E 



log^iMLl 



(7) 
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The mutual information (MI) is always non-negative, symmetric, and equals zero only if 
X and y are independent. For normally distributed variables the mutual information is 
closely related to the usual Pearson correlation, 

Mi(x,y) = -liog(i-p2). 

Therefore, mutual information is a natural measure of the association between genes, 
regardless whether linear or nonlinear in nature. 



5.2 Estimation of Mutual Information 

To construct an entropy network, we first need to estimate mutual information for all 
pairs of genes. The entropy representation 

MI(X,y) = H(X) + H(Y)-H(X,y), (8) 

shows that MI can be computed from the joint and marginal entropies of the two genes 
X and y. Note that this definition is equivalent to the one given in Eq.[7] which is based 
on the KuUback-Leibler divergence. From Eq. [sjit is also evident that MI(X, y) is the 
information shared between the two variables. 

For gene expression data the estimation of MI and the underlying entropies is 
challenging due to the small sample size, which requires the use of a regularized entropy 
estimator such as the shrinkage approach we propose here. Specifically, we proceed as 
follows: 

• As a prerequisite the data must be discrete, with each measurement assuming 
one of K levels. If the data are not already discretized, we propose employing the 



simple algorithm of Freedman and DiaconiS| ( ,1981^ , considering the measurements 



of all genes simultaneously. 

Next, we estimate the p = cell frequencies of the K x K contingency table for 
each pair X and Y using the shrinkage approach (Eqs.|4]and |5|. Note that typically 
the sample size n is much smaller than X^, thus simple approaches such as ML are 
not valid. 

Finally, from the estimated cell frequencies we calculate H(X), H{Y), H{X,Y) and 
the desired MI(X, y). 



5.3 Mutual Information Network for E. Coli Stress Response Data 



For illustration, we now analyze data from Schmidt-Heck et al. (2004 1 who conducted an 
experiment to observe the stress response in E. Coli during expression of a recombinant 
protein. This data set was also used in previous linear network analyzes, for example, 
in Schafer and Strimmer ( 2005b| |. The raw data consist of 4289 protein coding genes, 
on which measurements were taken at 0, 8, 15, 22, 45, 68, 90, 150, and 180 minutes. We 
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Figure 2: Left: Distribution of estimated mutual information values for all 5151 gene pairs 
of the E. coli data set. Right: Mutual information values after applying the ARACNE 
gene pair selection procedure. Note that the most Mis have been set to zero by the 
ARACNE algorithm. 



focus on a subset of G = 102 differentially expressed genes as given in Schmidt-Heck 
etari ( |2004| . 



Discretization of the data according to Freedman and Diaconis (1981) yielded K = 16 
distinct gene expression levels. From the G = 102 genes, we estimated Mis for 5151 
pairs of genes. For each pair, the mutual information was based on an estimated 16 x 16 
contingency table, hence p = 256. As the number of time points is n = 9, this is a 
strongly undersampled situation which requires the use of a regularized estimate of 
entropy and mutual information. 

The distribution of the shrinkage estimates of mutual information for all 5151 gene 
pairs is shown in the left side of Fig. |2j The right hand side depicts the distribution of 
mutual information values after applying the ARACNE procedure, which yields 112 
gene pairs with nonzero Mis. 

The model selection provided by ARACNE is based on applying the information 
processing inequality to all gene triplets. For each triplet, the gene pair corresponding 
to the smallest MI is discarded, which has the effect to remove gene-gene links that 
correspond to indirect rather than direct interactions. This is similar to a procedure 
used in graphical Gaussian models where correlations are transformed into partial 
correlations. Thus, both the ARACNE and the MRNET algorithms can be considered 
as devices to approximate the conditional mutual information ( [Meyer et al. 2007[ |. As 
a result, the 112 nonzero Mis recovered by the ARACNE algorithm correspond to 
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Figure 3: Mutual information network for the E. coli data inferred by the ARACNE 
algorithm based on shrinkage estimates of entropy and mutual information. 



12 



statistically detectable direct associations. 

The corresponding gene association network is depicted in Fig. |3j The most striking 
feature of the graph are the "hubs" belonging to genes hupB, sucA and nuoL. hupB 
is a well known DNA-binding transcriptional regulator, whereas both nuoL and sucA 
are key components of the E. coli metabolism. Note that a Lasso-type procedure (that 
implicitly limits the number of edges that can connect to each node) such as that of 
|Meinshausen and Buhlmarm ( |2006| l cannot recover these hubs. 



6 Discussion 



We proposed a James-Stein-type shrinkage estimator for inferring entropy and mutual 
information from small samples. While this is a challenging problem, we showed 
that our approach is highly efficient both statistically and computationally despite its 
simplicity. 

In terms of versatility, our estimator has two distinct advantages over the NSB and 
Chao-Shen estimators. First, in addition to estimating the entropy, it also provides the 
underlying multinomial frequencies for use with the Shannon formula (Eq. This 
is useful in the context of using mutual information to quantify non-linear pairwise 
dependencies for instance. Second, unlike NSB, it is a fully analytic estimator. 

Hence, our estimator suggests itself for applications in large scale estimation prob- 
lems. To demonstrate its application in the context of genomics and systems biology, we 
have estimated an entropy-based gene dependency network from expression data in 
E. coli. This type of approach may prove helpful to overcome the limitations of linear 
models currently used in network analysis. 

In short, we believe the proposed small-sample entropy estimator will be a valuable 
contribution to the growing toolbox of machine learning and statistics procedures for 
high-dimensional data analysis. 



Appendix A: Recipe For Constructing James-Stein-type Shrink- 
age Estimators 



The original James-Stein estimator (James and Stein 1961| was proposed to estimate the 



mean of a multivariate normal distribution from a single (n = 1!) vector observation. 
Specifically, if x is a sample from Np{fi, I) then James-Stein estimator is given by 

= (1 - ^)-^ 

Intriguingly, this estimator outperforms the maximum likelihood estimator p.^^ = in 
terms of mean squared error if the dimension is p > 3. Hence, the James-Stein estimator 
dominates the maximum likelihood estimator. 
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The above estimator can be slightly generalized by shrinking towards the component 
average x = '}2k=i rather than to zero, resulting in 



Shrink 



w^ith estimated shrinkage intensity 

A* 



X*x + (1 - X*)Xk 



p-3 



The James-Stein shrinkage principle is very general and can be put to to use in 
many other high-dimensional settings. In the following we summarize a simple recipe 



for constructing James-Stein-type shrinkage estimators along the lines of Schafer and 
Strimmer ( |2005b[ > and Opgen-Rhein and Strimmer ( [2007a[ >. 

In short, there are two key ideas at work in James-Stein shrinkage: 

i) regularization of a high-dimensional estimator 6 by linear combination with a 
lower-dimensional target estimate g^^'^^'^'^ and 

ii) adaptive estimation of the shrinkage parameter A from the data by quadratic risk 
minimization. 



A general form of a James-Stein-tj^e shrinkage estimator is given by 



^Shrink 

V 



AO 



Target 



+ (1-A)0. 



(9) 



^Target 



Note that 6 and 9 are two very different estimators (for the same underlying model!). 
as a high-dimensional estimate with many independent components has low bias but 

for small samples a potentially large variance. In contrast, the target estimate 6 is 
low-dimensional and therefore is generally less variable than 6 but at the same time is 
also more biased. The James-Stein estimate is a weighted average of these two estimators, 
where the weight is chosen in a data-driven fashion such that 0^^'^™'^ ig improved in 
terms of mean squared error relative to both and 6 . 

A key advantage of James-Stein-type shrinkage is that the optimal shrinkage intensity 
A* can be calculated analytically and without knowing the true value 6, via 



ELiVar(4) 



«Target 



Target p J 



(10) 



A simple estimate of A* is obtained by replacing all variances and covariances in Eq. 10 
with their empirical counterparts, followed by truncation of A* at 1 (so that A* < 1 
always holds). 



Eq.[TO 



Strimmer 



and Wolf 



is discussed in detail in Schafer and Strimmer ( 2005b| l and Opgen-Rhein and 
( 2007a| |. More specialized versions of it are treated, for example, in Ledoit 



( 2003 1 for unbiased and in Thompson ( 1968 1 (unbiased, univariate case with 
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deterministic target). A very early version (univariate with zero target) even predates 
the estimator of James and Stein, see 'Goodman (1953). For the multinormal setting of 
James and Stein ( 1961| l, Eq.|9]and Eq. 10 reduce to the shrinkage estimator described in 



Stigler| ( |1990] r 



James-Stein shrinkage has an empirical Bayes interpretation (Efron and Morris 1973[ ). 

Note, however, that only the first two moments of the distributions of 6 ^^^^'^ and 6 
need to be specified in Eq. 10 Hence, James-Stein estimation may be viewed as a quasi- 
empirical Bayes approach (in the same sense as in quasi-likelihood, which also requires 
only the first two moments). 



Appendix B: Computer Implementation 

The proposed shrinkage estimators of entropy and mutual information, as well as 
all other investigated entropy estimators, have been implemented in R ( |R DevelopJ 
|ment Core Tearn 2008[ |. A corresponding R package "entropy" was deposited in 
the R archive CRAN and is accessible at the URL http : //cran . r-proj ect . org/web7| 
|packages/entropy/ under the GNU General Public License. 
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